---
title: "Class Exercise 1"
output:
  pdf_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

```{r cars}
summary(cars)
```

## Including Plots

You can also embed plots, for example:

```{r pressure, echo=FALSE}
plot(pressure)
```

Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.

We will work with two separate datasets, `LakeHuron` and `Loblolly`. The dataset `LakeHuron` measures the water level of Lake Huron, one of the 5 Great Lakes, from 1875 to 1972.  `Loblolly` measures the growth and age of different loblolly seed varieties. “Loblollies” are loblolly pines, a fast-growing species important to the commercial timber industry.

Let’s look at `LakeHuron` first:

```{r}
data(LakeHuron)
LakeHuron
```

Does `LakeHuron` look like any of the data types we have already seen? Why or why not?

`LakeHuron` is actually a time series data set with data type ts; try the following query:

```{r eval=FALSE}
is.ts(LakeHuron)
```

We can confirm this by looking at its class attribute:

```{r eval=FALSE}
attributes(LakeHuron)
```

Note that the attribute `tsp` tells us when the series starts, when it ends, and its increment (E.g., monthly data to be plotted on a yearly scale would have an increment of 12); the `class` attribute identifies the data type. Both of these attributes are used by the `plot` command to create a time series plot with the proper time scale. Note that the command itself is quite simple, but uses built-in rules for plotting a ts object:

```{r eval=FALSE}
plot(LakeHuron)
```

Notice anything unusual in the time series? Lake Huron’s outlet, the St. Clair River, has
been extensively dredged over the years, creating a long-term decrease in lake level that
apparently leveled off decades ago. Concerns over fluctuating water levels of Lakes Michigan, Superior  and Huron continue to this day.

Let’s try plotting every 5 years’ data. Notice how easily multiple commands can be nested
in R.

```{r eval=FALSE}
plot(LakeHuron[seq(1,100,by=5)])
```

This plot appears quite different from our earlier time series plot–what has been changed?
The following command should save the 5-year subset as a time series object that can be
more properly plotted. Are any issues still unaddressed? How would you resolve them?

```{r eval=FALSE}
plot(ts(LakeHuron[seq(1,100,by=5)],start=1875,frequency=0.2))
```

Next we will work with the `Loblolly` data set.

```{r eval=FALSE}
data(Loblolly)
Loblolly
```

What kind of data set is this? Is it a matrix or a data frame?

```{r eval=FALSE}
is.matrix(Loblolly)
is.data.frame(Loblolly)
```


Since it is not a matrix, you might anticipate that commands commonly used with matrices
would not work. Try these:

```{r eval=FALSE}
dim(Loblolly)
Loblolly[1:5,]
```

Did they work? Clearly, some matrix commands can be applied to data frames.
Next we confirm that `Loblolly$Seed` is a factor; here we first type the variable name by
itself; does the way in which R prints the variable provide clues to the data type?

```{r eval=FALSE}
Loblolly$Seed
is.factor(Loblolly$Seed)
```

Now enter

```{r eval=FALSE}
names(Loblolly)
```

These names are not particularly descriptive; we can change them (not in the datasets library, but in our local workspace) if we’d like, then construct a scatterplot for two of the variables. Are the resulting names more satisfactory? What might be a disadvantage?

```{r eval=FALSE}
names(Loblolly)=c("Height (Ft)","Age (Yrs)","Seed Variety")
names(Loblolly)
Loblolly
```

**Tidyverse code**

In our class exercises this semester, we will include tidyverse code for students who would like to explore **R** further.  We will highlight code in the exercises that would be constructed differently if using tidyverse packages `dplyr` or `ggplot2`.

As examples, we will replot the original `LakeHuron` time series and rename the variables in `Loblolly`.  The plot is actually a poor introduction to `ggplot2`, since it relies on an automated function, `autoplot`, rather than the workhorse function, `ggplot`.  In general, the tidyverse is set up to work with dataframes and tibbles (the tidyverse version of a dataframe), rather than a specialized object such as a time series.

```{r eval=FALSE}
library(dplyr)
library(ggplot2)
library(ggfortify)
autoplot(LakeHuron,ts.colour="red",ylab="Water Level",xlab="Year")
```
We only manipulated a couple defaults here; what do you think of the graph appearance as compared to the graph produced by the `plot` command?  

To subset data, we need to use features in `dplyr` (pronounced dee-plier).  Not only are  function names different in `dplyr`, but rather than using a set of parentheses, `dplyr` encourages the use of a *pipe*--the sequence of symbols %>%s.  In the first example, we look at the first five rows of `Loblolly` using the `slice` command.  We could also use the syntax `slice(Loblolly,1:5)`, but have chosen the pipe syntax instead.  In the next column, we select every other row using a hybrid of `dplyr` commands--`n()` and `slice`--and regular **R** syntax--the `seq` command.  And then we finish with an alternate method for renaming columns.  What is your initial impression of the pipe operator?

```{r eval=FALSE}
Loblolly %>% slice(1:5)
Loblolly %>% slice(seq(1,n(),2))
Loblolly %>% rename("Height (Ft)"=height,"Age (Yrs)"=age,"Seed Variety"="Seed")
```